Acknowledgements

Material created by Peter Mac Data Science.

Objectives

  • Demonstrate how to create stripcharts with ggplot2’s geom_jitter()
  • Demonstrate the use of faceting
  • Demonstrate using tidyr’s gather() to convert from wide to long (tidy) format
  • Demonstrate dplyr’s select() to choose columns and filter() to choose rows
  • Demonstrate use of dplyr’s case_when() instead of multiple ifelse()s
  • Demonstrate the powerful pipe %>%

Introduction

In this tutorial, we will learn some R through creating stripcharts to visualise results from an RNA-seq experiment.

Stripcharts can be used to visualise the expression of samples by groups. They are used in RNA-seq analyses to check the top differentially expressed genes (or favourite genes). They enable us to see if the replicates within groups have similar expression values and to compare expression values between groups. In RNA-seq we plot the normalised count values.

RNA-seq dataset

We will again use the published RNA-seq data from the Nature Cell Biology paper by Fu et al. 2015. This study examined expression in basal and luminal cells from mice at different stages (virgin, pregnant and lactating).

Here we will work with the normalised counts for all 12 samples.

First we will load in the libraries we need. These are already installed for you but if you need to install it on your own computer you can use install.packages('tidyverse').

library(tidyverse)
norm_counts <- read_tsv("/training/r-intro-tidyverse/data/limma-voom_normalised_counts.tsv.gz")

Let’s take a look at the norm_counts data.

norm_counts

How many rows and columns are in norm_counts?

Stripcharts of multiple genes

We can make stripcharts to view the expression of multiple genes. Let’s plot the expression of the top 10 differentially expressed genes in luminal cells from the pregnant mice versus the luminal cells from the lactating mice. In the volcano plot tutorial we showed how to get these top 10 gene symbols. Here we will show how to create an object with the symbols manually. Remember, in R we use c() to combine multiple values.

top10_syms <- c("Csn1s2b", "Slc25a1", "Slc34a2", "Atp2b2", "Acacb", "Slc30a2", "Elovl5", "Egf", "Ceacam10", "Pmvk")

Then we extract the normalised counts information for these genes from the norm_counts file.

Filtering rows with filter()

We can use dplyr’s filter() to filter rows. Let’s take a look at how filter() works.

To use filter() we specify the data and the column(s) we want to use to filter with our criteria.

If we wanted to filter for rows (genes) in the MCL1.DG sample that have expression above a threshold (e.g. 5) we would write below.

filter(norm_counts, MCL1.DG > 5)

If we wanted to filter for genes that have an expression value above 5 in both the basal virgin samples (MCL1.DG and MCL1.DH) we specify both columns and the criteria. Note that we need to specify the threshold for each column e.g MCL1.DG > 5 & MCL1.DH > 5 and not MCL1.DG & MCL1.DH > 5.

filter(norm_counts, MCL1.DG > 5 & MCL1.DH > 5)

To filter this dataset for the rows that contain the Csn1s2b gene we would write below. Note that we use a == when we want to test if a value matches exactly.

filter(norm_counts, SYMBOL == "Csn1s2b")

If we wanted to filter for 2 genes we could write that as below. Here we use “|” which means or as we want the row if the SYMBOL column contains Csn1s2b or Slc25a1.

filter(norm_counts, SYMBOL == "Csn1s2b" | SYMBOL == "Slc25a1")

Exercise

  • Try to filter for all rows containing casein in the GENENAME column. Hint: Take a look at the help for str_detect() a function from the tidyverse stringr package.

If we want to filter the rows for our top 10 genes. We use SYMBOL %in% top10_syms which means we want all the rows where the SYMBOL column contains one of the gene symbols in top10_syms. Remember, in R we use %in% to test if a value is in a set of values.

top10_counts <- filter(norm_counts, SYMBOL %in% top10_syms)

Take a look.

top10_counts

Selecting columns with select()

We don’t need the ENTREZID and GENENAME columns, we are only going to use the SYMBOL column and the counts so we can use select() to remove these columns.

filter() is used to choose rows, to choose columns we use select(). Let’s take a look at how select() works.

If we wanted to select the gene symbol column we would write below.

select(top10_counts, SYMBOL)

If we wanted the SYMBOL column and the MCL1.DG sample column we would write below.

select(top10_counts, SYMBOL, MCL1.DG)

We can select column ranges with :.

select(top10_counts, ENTREZID:GENENAME)

There are also useful helper functions you can use inside select(), such as contains(), starts_with() and ends_with().

Exercise

  • Try to select all (and only) the counts columns. Hint: There is more than one way to do it.

We can also use select() to remove columns by specifying a “-”" before the column name(s).

Let’s remove the ENTREZID and GENENAME columns so we only keep the SYMBOL column and the sample expression values.

top10_counts <- select(top10_counts, -ENTREZID, -GENENAME)
top10_counts

The base R way of selecting columns

As mentioned in the tidyverse tutorial here, in base R we can select columns using $ or [], for example, to select the SYMBOL column we could use top10_counts$SYMBOL or top10_counts[, "SYMBOL"]. The $ operator works well for single columns, but for multiple columns it quickly starts to get cumbersome as we need to use the [] operator and c() for combining the required columns. The column names also need quotation marks. For example, to access both the SYMBOL and GENENAME columns we would use top10_counts[, c("SYMBOL", "GENENAME")] whereas with tidyverse’s dplyr it is a lot more intuitive select(top10_counts, SYMBOL, GENENAME).

Converting wide into tidy format with gather()

To more easily plot with ggplot2 we need to change the data into “tidy data”. There are 3 rules of tidy data:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

In our expression table there should be just one column for all expression values instead of multiple columns. We can use tidyr’s gather() to easily change the format.

The SYMBOL information is already in a column so we use “-” to say leave that column alone. Then gather() will reformat all the other columns (the expression values) into two columns “key” and “value”. The “key” column will contain the column names (our sample ids), and the “value” column will contain the values from the columns (our expression values). We will give the key column the name “Sample” and the value column the name “Norm_counts”.

# change to tidy data format (all expression values in one column)
top10_counts <- gather(top10_counts, key = Sample, value = Norm_counts, -SYMBOL)

Take a look.

top10_counts

Take a closer look with View()

View(top10_counts)

Exercise

  • Try running gather() on the norm_counts object and save it as an object called testing (i.e. run testing <- gather(norm_counts). What do you think of the output? Can you improve it so that there is a column with sample ids and a column with counts.
  • spread() is the opposite of gather(). Try running spread() on the top10_counts object and see if you can regenerate the table with samples in columns.

Testing multiple conditions with case_when()

We want to plot and compare the expression in the groups, so we use mutate()to add a column called “Group” to say what group each sample belongs to. Here we use dplyr’s case_when() to say if the same id matches are conditions then add the appropriate group name into the Group column. We could use ifelse() as we did in the volcano plot tutorial. However, when there are many conditions to test, as there are here, case_when() is easier to use.

top10_counts <- mutate(top10_counts, Group=case_when(
        Sample %in% c("MCL1.DG", "MCL1.DH")  ~ "basal virgin",
        Sample %in% c("MCL1.DI", "MCL1.DJ")  ~ "basal pregnant",
        Sample %in% c("MCL1.DK", "MCL1.DL")  ~ "basal lactate",
        Sample %in% c("MCL1.LA", "MCL1.LB")  ~ "luminal virgin",
        Sample %in% c("MCL1.LC", "MCL1.LD")  ~ "luminal pregnant",
        Sample %in% c("MCL1.LE", "MCL1.LF")  ~ "luminal lactate"
       ))

Take a look at the data again with View()

View(top10_counts)

Exercise

  • Try to use case_when() to add a column called CellType. This column should contain the value basal if the Group column contains the word basal, or luminal if the Group contains luminal. Save it as an object called testing. Hint: Use str_detect() inside case_when().

Creating stripcharts with geom_jitter()

Now we can make a stripchart. We plot the Group on the X axis and the Norm_counts on the y axis. We will use + geom_jitter() to create a jitter plot. A jitter plot is similar to a scatter plot but adds a small amount of random variation to the location of each point so they don’t overlap. Why do we not just use a scatter plot? Let’s take a look.

ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts)) +
  geom_point()

Some of the points are overlapping so we use geom_jitter() to avoid having overlapping points.

ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts)) +
  geom_jitter()

Creating stripcharts for each gene with facet_wrap()

The points are no longer overlapping, however, this is all the genes in one plot. ggplot2 has a really useful feature called faceting that we can use. facet_wrap() will create plots for every value in a column in our data. We would like a stripchart of expression values for each gene so we add + facet_wrap(~SYMBOL) to say we want to a plot for each value in the SYMBOL column. There is also facet_grid() which is most useful when you have two discrete variables, and all combinations of the variables exist in the data.

ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts)) +
  geom_jitter() +
  facet_wrap(~SYMBOL)

We can change the number of rows (nrows) or columns (ncols) to balance the plot. Let’s try 2 rows (nrow=2).

ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts)) +
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2)

We can add col=Group to say we want to colour by the groups (the Group column we added).

ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts, col=Group)) +
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2)

We can add + theme(axis.text.x = element_text(angle = 90) to make the x axis labels vertical so they don’t overlap.

ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts, col=Group)) +
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2) +
  theme(axis.text.x = element_text(angle = 90)) 

Ordering categories along an axis

The groups have been plotted in alphabetical order on the x axis, however, we may want to change the order. We may prefer to plot the groups in order of stage, for example, basal virgin, basal pregnant, basal lactate, luminal virgin, luminal pregnant, luminal lactate. In the volcano plot tutorial we showed how to change the order of items in the legend with breaks= into the scale layer. Let’s try that here.

First let’s make an object with the group order that we want.

group_order <- c("basal virgin", "basal pregnant", "basal lactate", "luminal virgin", "luminal pregnant", "luminal lactate")

Then let’s add this group_order into breaks=.

ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts, col=Group)) +
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2) +
  theme(axis.text.x = element_text(angle = 90)) +
  scale_colour_discrete(breaks=group_order)

That reordered the legend but notice that it didn’t reorder the groups on the x axis. If we want to reorder groups along an axis with ggplot we need to make the column with the groups into an R data type called a factor. Factors in R are used to specify categories.

We’ll add another column called “Group_f” and make the Group into a factor and specify what order we want the levels (names of the categories).

top10_counts <- mutate(top10_counts, Group_f=factor(Group, levels=group_order))

Take a look with View().

View(top10_counts)

The Group and the Group_f column look the same but take a look by typing top10_counts.

top10_counts

Notice that the Group column has under the heading, it is a character data type, while the Group_f column has under the heading, it is a factor data type. The str() command is useful to check the data types in objects.

str(top10_counts)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   120 obs. of  5 variables:
 $ SYMBOL     : chr  "Slc25a1" "Pmvk" "Egf" "Slc30a2" ...
 $ Sample     : chr  "MCL1.DG" "MCL1.DG" "MCL1.DG" "MCL1.DG" ...
 $ Norm_counts: num  4.881 3.618 -0.89 0.951 -0.229 ...
 $ Group      : chr  "basal virgin" "basal virgin" "basal virgin" "basal virgin" ...
 $ Group_f    : Factor w/ 6 levels "basal virgin",..: 1 1 1 1 1 1 1 1 1 1 ...

str() shows us Group_f column is a Factor with 6 levels. To see the levels we can use levels(). To check the levels of the Group_f columns we could use select() to select the Group_f column followed by pull() to pull the values out of column format.

levels(pull(select(top10_counts, Group_f)))
[1] "basal virgin"     "basal pregnant"   "basal lactate"    "luminal virgin"   "luminal pregnant" "luminal lactate" 

However, in this case it’s simpler to use the base R method, as we can use top10_counts$Group_f to access the values in the Group_f column directly.

levels(top10_counts$Group_f)
[1] "basal virgin"     "basal pregnant"   "basal lactate"    "luminal virgin"   "luminal pregnant" "luminal lactate" 

We can then change our plot to use the “Group_f” column instead of Group (change x= and col=).

ggplot(data=top10_counts, mapping=aes(x=Group_f, y=Norm_counts, col=Group_f)) +
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2) +
  theme(axis.text.x = element_text(angle = 90)) 

Now both the legend and the x axis have the groups in the order that we want.

These are the top genes in the comparison of luminal cells from pregnant vs lactating mice and this type of plot enables us to see what the expression values look like in all the groups. We can see that some genes, such as Pmvk, have more similar expression across all the groups than others, such as Csn1s2b.

Notice that the genes have also been plotted in alphabetical order in the facets. If we wanted to plot these genes in the order of most signficant, then we need to make symbol column into a factor as we did for the groups.

Exercise

  • Plot the genes in order of the significance. Hint: Use mutate to add a column called SYMBOL_f containing SYMBOL as a factor with the levels in the order in top10_syms. Then remke the plot using SYMBOL_f in facet_wrap instead of SYMBOL.

Creating workflows using %>%

One of the most useful and powerful things about the tidyverse is dplyr’s pipe operator %>%. This enables you to chain commmands together into workflows. For example, we could chain together the commands we ran earlier on our top10_counts object. Note that like the + we use to add layers to a ggplot, %>% goes at the end of the line and this ‘pipes’ the output into the next command. If we use the %>% we also don’t need to specify the data inside the individual commands, we only need to specify it at the beginning e.g. instead of filter(norm_counts, SYMBOL %in% top10_syms) we use norm_counts %>% filter(SYMBOL %in% top10_syms). We can also pipe outputs directly into ggplot(). The pipe is very useful but some advice on when not to use the pipe is provided by the tidyverse creator Hadley Wickham in the excellent R for Data Science book.

norm_counts %>%
  filter(SYMBOL %in% top10_syms) %>%                             # filter rows for the top 10 genes
  select(-ENTREZID, -GENENAME) %>%                               # remove ENTREZID and GENENAME columns
  gather(Sample, Norm_counts, -SYMBOL) %>%                       # convert from wide to long (tidy) format
  mutate(Group=case_when(                                        # add a column to specify groups
        Sample %in% c("MCL1.DG", "MCL1.DH")  ~ "basal virgin",
        Sample %in% c("MCL1.DI", "MCL1.DJ")  ~ "basal pregnant",
        Sample %in% c("MCL1.DK", "MCL1.DL")  ~ "basal lactate",
        Sample %in% c("MCL1.LA", "MCL1.LB")  ~ "luminal virgin",
        Sample %in% c("MCL1.LC", "MCL1.LD")  ~ "luminal pregnant",
        Sample %in% c("MCL1.LE", "MCL1.LF")  ~ "luminal lactate"
       )) %>%
  mutate(Group_f=factor(Group, levels=group_order)) %>%           # convert groups column to factor data type to specify ordering
  ggplot(mapping=aes(x=Group_f, y=Norm_counts, col=Group_f)) +    # make stripcharts faceted by gene
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2) +
  theme(axis.text.x = element_text(angle = 90)) 

We only have two points per group here but if there were more points you could overlay an error bar with geom_errorbar() or combine a geom_boxplot() with the geom_jitter(), as shown for the stripcharts in the tutorial here.

Exercise

  • Starting from norm_counts make a workflow using %>% that creates stripcharts for the genes Trp53, Brca1, Brca2.
---
title: "Introduction to R"
author: "Maria Doyle"
date: "`r format(Sys.time(), '%d %B %Y')`"
output: 
  html_notebook:
    toc: yes
    toc_float: yes
    toc_depth: 4
subtitle: visualising RNA-seq data with stripcharts
---

#### Acknowledgements
Material created by Peter Mac Data Science.

## Objectives
  
* Demonstrate how to create stripcharts with ggplot2's `geom_jitter()`
* Demonstrate the use of faceting
* Demonstrate using tidyr's `gather()` to convert from wide to long (tidy) format
* Demonstrate dplyr's `select()` to choose columns and `filter()` to choose rows
* Demonstrate use of dplyr's `case_when()` instead of multiple `ifelse()`s
* Demonstrate the powerful pipe `%>%`


## Introduction

In this tutorial, we will learn some R through creating stripcharts to visualise results from an RNA-seq experiment.

Stripcharts can be used to visualise the expression of samples by groups. They are used in RNA-seq analyses to check the top differentially expressed genes (or favourite genes). They enable us to see if the replicates within groups have similar expression values and to compare expression values between groups. In RNA-seq we plot the normalised count values.

### RNA-seq dataset

We will again use the published RNA-seq data from the Nature Cell Biology paper by [Fu et al. 2015](https://www.ncbi.nlm.nih.gov/pubmed/25730472). This study examined expression in basal and luminal cells from mice at different stages (virgin, pregnant and lactating).

Here we will work with the normalised counts for all 12 samples.
![](images/mouse_exp.png)

First we will load in the libraries we need. These are already installed for you but if you need to install it on your own computer you can use `install.packages('tidyverse')`. 

```{r}
library(tidyverse)
```

```{r results = 'hide'}
norm_counts <- read_tsv("/training/r-intro-tidyverse/data/limma-voom_normalised_counts.tsv.gz")
```

Let's take a look at the `norm_counts` data.

```{r}
norm_counts
```

How many rows and columns are in `norm_counts`?

## Stripcharts of multiple genes

We can make stripcharts to view the expression of multiple genes. Let's plot the expression of the top 10 differentially expressed genes in luminal cells from the pregnant mice versus the luminal cells from the lactating mice. In the volcano plot tutorial we showed how to get these top 10 gene symbols. Here we will show how to create an object with the symbols manually. Remember, in R we use `c()` to combine multiple values.

```{r}
top10_syms <- c("Csn1s2b", "Slc25a1", "Slc34a2", "Atp2b2", "Acacb", "Slc30a2", "Elovl5", "Egf", "Ceacam10", "Pmvk")
```

Then we extract the normalised counts information for these genes from the `norm_counts` file.

## Filtering rows with `filter()`

We can use dplyr's `filter()` to filter rows. Let's take a look at how `filter()` works. 

To use `filter()` we specify the data and the column(s) we want to use to filter with our criteria. 

If we wanted to filter for rows (genes) in the MCL1.DG sample that have expression above a threshold (e.g. 5) we would write below.
```{r}
filter(norm_counts, MCL1.DG > 5)
```

If we wanted to filter for genes that have an expression value above 5 in both the basal virgin samples (MCL1.DG and MCL1.DH) we specify both columns and the criteria. Note that we need to specify the threshold for each column e.g `MCL1.DG > 5 & MCL1.DH > 5` and not `MCL1.DG & MCL1.DH > 5`.
```{r}
filter(norm_counts, MCL1.DG > 5 & MCL1.DH > 5)
```


To filter this dataset for the rows that contain the Csn1s2b gene we would write below. Note that we use a `==` when we want to test if a value matches exactly.
```{r}
filter(norm_counts, SYMBOL == "Csn1s2b")
```
If we wanted to filter for 2 genes we could write that as below. Here we use "|" which means or as we want the row if the SYMBOL column contains Csn1s2b or Slc25a1.

```{r}
filter(norm_counts, SYMBOL == "Csn1s2b" | SYMBOL == "Slc25a1")
```

#### Exercise

* Try to filter for all rows containing casein in the GENENAME column. Hint: Take a look at the help for `str_detect()` a function from the tidyverse **stringr** package.

If we want to filter the rows for our top 10 genes. We use `SYMBOL %in% top10_syms` which means we want all the rows where the SYMBOL column contains one of the gene symbols in `top10_syms`. Remember, in R we use `%in%` to test if a value is in a set of values.

```{r}
top10_counts <- filter(norm_counts, SYMBOL %in% top10_syms)
```

Take a look.

```{r}
top10_counts
```


## Selecting columns with `select()`

We don't need the ENTREZID and GENENAME columns, we are only going to use the SYMBOL column and the counts so we can use `select()` to remove these columns.

`filter()` is used to choose rows, to choose columns we use `select()`. Let's take a look at how `select()` works.

If we wanted to select the gene symbol column we would write below.

```{r}
select(top10_counts, SYMBOL)
```

If we wanted the SYMBOL column and the MCL1.DG sample column we would write below.

```{r}
select(top10_counts, SYMBOL, MCL1.DG)
```

We can select column ranges with `:`.

```{r}
select(top10_counts, ENTREZID:GENENAME)
```

There are also useful helper functions you can use inside `select()`, such as `contains()`, `starts_with()` and `ends_with()`.

#### Exercise

* Try to select all (and only) the counts columns. Hint: There is more than one way to do it.

We can also use `select()` to remove columns by specifying a "-"" before the column name(s).

Let's remove the ENTREZID and GENENAME columns so we only keep the SYMBOL column and the sample expression values.

```{r}
top10_counts <- select(top10_counts, -ENTREZID, -GENENAME)
top10_counts
```

> The base R way of selecting columns 
> 
> As mentioned in the tidyverse tutorial 
> [here](https://rawgit.com/bioinformatics-core-shared-training/r-intermediate/master/2.dplyr-intro-live-coding-script.html), 
> in base R we can select columns using `$` or `[]`, for example, to select the SYMBOL column we could use 
> `top10_counts$SYMBOL` or `top10_counts[, "SYMBOL"]`. The `$` operator works well for single columns, but for multiple columns
> it quickly starts to get cumbersome as we need to use the `[]` operator and `c()` for combining the required columns. The column 
> names also need quotation marks. For example, to access both the SYMBOL and GENENAME 
> columns we would use `top10_counts[, c("SYMBOL", "GENENAME")]` whereas with tidyverse's dplyr it is a lot more intuitive 
> `select(top10_counts, SYMBOL, GENENAME)`.


## Converting wide into tidy format with `gather()`

To more easily plot with ggplot2 we need to change the data into ["tidy data"](https://en.wikipedia.org/wiki/Tidy_data). There are 3 rules of tidy data:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell. 
  
In our expression table there should be just one column for all expression values instead of multiple columns. We can use tidyr's `gather()` to easily change the format.

The SYMBOL information is already in a column so we use "-" to say leave that column alone. Then `gather()` will reformat *all the other columns* (the expression values) into two columns "key" and "value". The "key" column will contain the column names (our sample ids), and the "value" column will contain the values from the columns (our expression values). We will give the key column the name "Sample" and the value column the name "Norm_counts".

```{r}
# change to tidy data format (all expression values in one column)
top10_counts <- gather(top10_counts, key = Sample, value = Norm_counts, -SYMBOL)
```

Take a look.
```{r}
top10_counts
```

Take a closer look with `View()`
```{r eval=FALSE}
View(top10_counts)
```

#### Exercise

 * Try running `gather()` on the `norm_counts` object and save it as an object called `testing` (i.e. run `testing <- gather(norm_counts)`. What do you think of the output? Can you improve it so that there is a column with sample ids and a column with counts.
 * `spread()` is the opposite of `gather()`. Try running `spread()` on the `top10_counts` object and see if you can regenerate the table with samples in columns.

## Testing multiple conditions with `case_when()`

We want to plot and compare the expression in the groups, so we use `mutate()`to add a column called "Group" to say what group each sample belongs to.  Here we use dplyr's `case_when()` to say if the same id matches are conditions then add the appropriate group name into the Group column. We could use `ifelse()` as we did in the volcano plot tutorial. However, when there are many conditions to test, as there are here, `case_when()` is easier to use.

```{r}
top10_counts <- mutate(top10_counts, Group=case_when(
        Sample %in% c("MCL1.DG", "MCL1.DH")  ~ "basal virgin",
        Sample %in% c("MCL1.DI", "MCL1.DJ")  ~ "basal pregnant",
        Sample %in% c("MCL1.DK", "MCL1.DL")  ~ "basal lactate",
        Sample %in% c("MCL1.LA", "MCL1.LB")  ~ "luminal virgin",
        Sample %in% c("MCL1.LC", "MCL1.LD")  ~ "luminal pregnant",
        Sample %in% c("MCL1.LE", "MCL1.LF")  ~ "luminal lactate"
       ))
```

Take a look at the data again with `View()`
```{r}
View(top10_counts)
```


#### Exercise

* Try to use `case_when()` to add a column called CellType. This column should contain the value basal if the Group column contains the word basal, or luminal if the Group contains luminal. Save it as an object called `testing`. Hint: Use `str_detect()` inside `case_when()`.


## Creating stripcharts with `geom_jitter()`

Now we can make a stripchart. We plot the Group on the X axis and the Norm_counts on the y axis. We will use `+ geom_jitter()` to create a jitter plot. A jitter plot is similar to a scatter plot but adds a small amount of random variation to the location of each point so they don't overlap. Why do we not just use a scatter plot? Let's take a look. 

```{r}
ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts)) +
  geom_point()
```

Some of the points are overlapping so we use `geom_jitter()` to avoid having overlapping points.

```{r}
ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts)) +
  geom_jitter()
```

## Creating stripcharts for each gene with `facet_wrap()`

The points are no longer overlapping, however, this is all the genes in one plot. ggplot2 has a really useful feature called faceting that we can use. `facet_wrap()` will create plots for every value in a column in our data. We would like a stripchart of expression values for each gene so we add ` + facet_wrap(~SYMBOL)` to say we want to a plot for each value in the SYMBOL column. There is also `facet_grid()` which is most useful when you have two discrete variables, and all combinations of the variables exist in the data.

```{r}
ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts)) +
  geom_jitter() +
  facet_wrap(~SYMBOL)
```

We can change the number of rows (`nrows`) or columns (`ncols`) to balance the plot. Let's try 2 rows (`nrow=2`).

```{r}
ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts)) +
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2)
```


We can add `col=Group` to say we want to colour by the groups (the Group column we added).

```{r}
ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts, col=Group)) +
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2)
```

We can add `+ theme(axis.text.x = element_text(angle = 90)` to make the x axis labels vertical so they don't overlap.

```{r}
ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts, col=Group)) +
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2) +
  theme(axis.text.x = element_text(angle = 90)) 
```

## Ordering categories along an axis

The groups have been plotted in alphabetical order on the x axis, however, we may want to change the order. We may prefer to plot the groups in order of stage, for example, basal virgin, basal pregnant, basal lactate, luminal virgin, luminal pregnant, luminal lactate. In the volcano plot tutorial we showed how to change the order of items in the legend with `breaks=` into the scale layer. Let's try that here.

First let's make an object with the group order that we want.
```{r}
group_order <- c("basal virgin", "basal pregnant", "basal lactate", "luminal virgin", "luminal pregnant", "luminal lactate")
```

Then let's add this `group_order` into `breaks=`.

```{r}
ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts, col=Group)) +
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2) +
  theme(axis.text.x = element_text(angle = 90)) +
  scale_colour_discrete(breaks=group_order)
```

That reordered the legend but notice that it didn't reorder the groups on the x axis. If we want to reorder groups along an axis with ggplot we need to make the column with the groups into an R data type called a `factor`. Factors in R are used to specify categories.

We'll add another column called "Group_f" and make the Group into a factor and specify what order we want the levels (names of the categories).

```{r}
top10_counts <- mutate(top10_counts, Group_f=factor(Group, levels=group_order))
```

Take a look with `View()`.
```{r eval=FALSE}
View(top10_counts)
```

The Group and the Group_f column look the same but take a look by typing `top10_counts`.

```{r}
top10_counts
```

Notice that the Group column has <chr> under the heading, it is a character data type, while the Group_f column has <fct> under the heading, it is a factor data type. The `str()` command is useful to check the data types in objects.

```{r}
str(top10_counts)
```

`str()` shows us Group_f column is a Factor with 6 levels. To see the levels we can use `levels()`. To check the levels of the Group_f columns we could use `select()` to select the Group_f column followed by `pull()` to pull the values out of column format.

```{r}
levels(pull(select(top10_counts, Group_f)))
```

However, in this case it's simpler to use the base R method, as we can use `top10_counts$Group_f` to access the values in the Group_f column directly.

```{r}
levels(top10_counts$Group_f)
```


We can then change our plot to use the "Group_f" column instead of Group (change `x=` and `col=`).

```{r}
ggplot(data=top10_counts, mapping=aes(x=Group_f, y=Norm_counts, col=Group_f)) +
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2) +
  theme(axis.text.x = element_text(angle = 90)) 
```

Now both the legend and the x axis have the groups in the order that we want.

These are the top genes in the comparison of luminal cells from pregnant vs lactating mice and this type of plot enables us to see what the expression values look like in all the groups. We can see that some genes, such as Pmvk, have more similar expression across all the groups than others, such as Csn1s2b.

Notice that the genes have also been plotted in alphabetical order in the facets. If we wanted to plot these genes in the order of most signficant, then we need to make symbol column into a factor as we did for the groups.


#### Exercise

* Plot the genes in order of the significance. Hint: Use mutate to add a column called SYMBOL_f containing SYMBOL as a factor with the levels in the order in `top10_syms`. Then remke the plot using SYMBOL_f in facet_wrap instead of SYMBOL.


## Creating workflows using `%>%`

One of the most useful and powerful things about the tidyverse is dplyr's pipe operator `%>%`. This enables you to chain commmands together into workflows. For example, we could chain together the commands we ran earlier on our `top10_counts` object. Note that like the `+` we use to add layers to a ggplot, `%>%` goes at the end of the line and this 'pipes' the output into the next command. If we use the `%>%` we also don't need to specify the data inside the individual commands, we only need to specify it at the beginning e.g. instead of `filter(norm_counts, SYMBOL %in% top10_syms)` we use `norm_counts %>% filter(SYMBOL %in% top10_syms)`. We can also pipe outputs directly into `ggplot()`. The pipe is very useful but some advice on when not to use the pipe is provided by the tidyverse creator Hadley Wickham in the excellent [R for Data Science book](https://r4ds.had.co.nz/pipes.html#when-not-to-use-the-pipe).

```{r}
norm_counts %>%
  filter(SYMBOL %in% top10_syms) %>%                             # filter rows for the top 10 genes
  select(-ENTREZID, -GENENAME) %>%                               # remove ENTREZID and GENENAME columns
  gather(Sample, Norm_counts, -SYMBOL) %>%                       # convert from wide to long (tidy) format
  mutate(Group=case_when(                                        # add a column to specify groups
        Sample %in% c("MCL1.DG", "MCL1.DH")  ~ "basal virgin",
        Sample %in% c("MCL1.DI", "MCL1.DJ")  ~ "basal pregnant",
        Sample %in% c("MCL1.DK", "MCL1.DL")  ~ "basal lactate",
        Sample %in% c("MCL1.LA", "MCL1.LB")  ~ "luminal virgin",
        Sample %in% c("MCL1.LC", "MCL1.LD")  ~ "luminal pregnant",
        Sample %in% c("MCL1.LE", "MCL1.LF")  ~ "luminal lactate"
       )) %>%
  mutate(Group_f=factor(Group, levels=group_order)) %>%           # convert groups column to factor data type to specify ordering
  ggplot(mapping=aes(x=Group_f, y=Norm_counts, col=Group_f)) +    # make stripcharts faceted by gene
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2) +
  theme(axis.text.x = element_text(angle = 90)) 
```
We only have two points per group here but if there were more points you could overlay an error bar with `geom_errorbar()` or combine a `geom_boxplot()` with the `geom_jitter()`, as shown for the stripcharts in the tutorial [here](https://www.bioinformatics.babraham.ac.uk/training/ggplot_course/Introduction%20to%20ggplot.pdf).

#### Exercise

* Starting from `norm_counts` make a workflow using `%>%` that creates stripcharts for the genes Trp53, Brca1, Brca2.

